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ABSTRACT 

Summary: The initial steps in the analysis of next-generation sequen- 
cing data can be automated by way of software 'pipelines'. However, 
individual components depreciate rapidly because of the evolving 
technology and analysis methods, often rendering entire versions of 
production informatics pipelines obsolete. Constructing pipelines from 
Linux bash commands enables the use of hot swappable modular 
components as opposed to the more rigid program call wrapping by 
higher level languages, as implemented in comparable published pipe- 
lining systems. 

Here we present Next Generation Sequencing ANalysis for Enterprises 
(NGSANE), a Linux-based, high-performance-computing-enabled 
framework that minimizes overhead for set up and processing of 
new projects, yet maintains full flexibility of custom scripting when 
processing raw sequence data. 

Availability and implementation: Ngsane is implemented in bash and 
publicly available under BSD (3-Clause) licence via GitHub at https:// 
github.com/BauerLab/ngsane. 
Contact: Denis.Bauer@csiro.au 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

The initial steps of analyzing next-generation sequencing (NGS) 
data can be automated in standardized pipelines, for e.g. the 
many steps in SNP calling and RNA-Seq analysis (Anders 
et al., 2013). This is critical, as further decreasing sequencing 
costs and expanding use of replicates to assess biological vari- 
ability (Auer and Doerge, 2010) will substantially increase future 
study sizes, therefore making the automated, documented and 
reproducible processing of large numbers of samples across di- 
verse projects using high-performance computing (HPC) clusters 
paramount. Yet, because of the constantly evolving technology, 
software and new application areas, maintaining such pro- 
duction informatics pipelines can be labor intensive. 

To address this issue, several software packages have been 
published in recent years. However, currently available tools 
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are either web-based services, e.g. Galaxy (Goecks et al., 2010), 
where even API-based access to the web service functionality is 
not readily amenable to production- scale analysis practices, or 
heavyweight frameworks written in user-friendly languages, such 
as SNAKEMAKE and NESTLY (Pythou) (Koster and Rahmann, 2012; 
McCoy et al, 2013), GATK's Queue (Scala) - https ://github. 
com/broadgsa/gatk/) or Bpipe (Groovy) (Sadedin et al, 2012), 
which encapsulate the actual program call in a wrapper script 
specific syntax, hindering the development of pipeline extensions. 

Ngsane is a lightweight, Linux-based, HPC-enabled frame- 
work that minimizes overhead for set up and processing of 
new projects, yet maintains full flexibility of custom scripting 
for processing raw sequence data. Ngsane allows end users 
and developers to construct pipehnes from call statements that 
can be tested on the command line directly without syntax alter- 
ations or wrapper script involvement providing flexibihty in soft- 
ware usage - a substantial advantage when analysis pipelines are 
constantly revised as new algorithms are developed. We describe 
Ngsane' s aims below. 

Data security and reusability. The framework separates pro- 
ject-specific files from reference data, scripts and software suites 
that are reusable in other projects (Fig. la). Access to confiden- 
tial data is handled transparently via the underlying Linux per- 
mission system. The transaction between projects and framework 
is facihtated by a project-specific configuration file that defines 
paths to reference data as well as the analysis tasks to perform. 
Ngsane supports systems with hierarchical storage management, 
specifically Data Migration Facility, by ensuring files are online 
when needed. 

HPC and parallel execution. Ngsane supports Sun Grid 
Engine and Portable Batch System job scheduling and can be 
operated in different modes for development and production, 
thus enabling flexible processing of NGS data. HPC job parti- 
tioning and submission is independent from the program calls, 
therefore enabhng new technologies (e.g. Hadoop) to be 
incorporated. 

Hot swapping and adaptability. Individual task blocks (e.g. 
read mapping) are packaged in bash script modules, which can 
be executed locally or on subsets to test module code, submission 
parameters and compute node environment in stages. During 
production, Ngsane automatically submits separate module 
calls for each input file or set of files to the HPC queue. This 
allows different existing modules, parameter settings or software 
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Fig. 1. (a) Separation of project data from NGSANE core, (b) Workflow of NGSANE. (c) Example of automatically created project summary 



versions to be executed by changes to the project-specific config- 
uration file rather than the software code (hot swapping). 

Reproducibility and checkpoint recovery. A full audit trail is 
generated recording performed tasks, used reference data, time- 
stamps, software version as well as HPC log files, including any 
errors. Ngsane gracefully recovers from unsuccessfully executed 
jobs, be it owing to failed commands, missing or incorrect input 
or under-resourced HPC jobs by cleanly restarting after the most 
recent successfully executed checkpoint. 

Robust execution and full monitoring. In our experience, modu- 
lar workflows are executed in stages with optional human quality 
control; NSANE hence focuses on providing robust checkpoint- 
ing and intuitive report generation (Fig. lb). However, work- 
flows can be fully automated by using NGSANE' s control 
over HPC queuing systems and by leveraging the customizable 
interfaces between modules when submitting multiple dependent 
stages at once. 

Automated project summary creation. Ngsane generates a 
high-level summary (Project Card, Fig. lb and c) to enable in- 
formed decisions about the experimental success. This interactive 
HTML report provides an access point for new lab members or 
collaborators. Furthermore, the Project Card can be used as a 
gold standard for software development when using a continu- 
ous integration server. 

Complete customization. Ngsane' s conflguration file contains 
details about the submission system, typical HPC resource allo- 
cations and location of third-party software. However, Ngsane's 
credo is that every parameter can be overwritten; hence, default 
parameters can be adjusted in the project- specific configuration 
file to indicate different software versions, additional resources 
or an altered output location. Additional parameters, such as a 
speciflc HPC queue, or new parameters in a software release, can 
be provided to each program via a special 'free form' variable in 
the configuration file. 

Repeated calls. As stated by McCoy et al. (2013), pipelines 
often have to be rerun on the full or a subset of the data with 
possibly altered parameter settings. Ngsane facilitates and docu- 
ments this by allowing multiple (automatically created) config- 
uration files. 

Knowledge transfer. Ngsane provides a unified framework 
(i.e. folder structure) for processing data from different experi- 
mental protocols. This allows co-investigators and reviewers to 



easily understand and reproduce work from Ngsane's log and 
report files. 

Ngsane is open source and available via GitHub. Currently 
implemented workflows include those for adapter trimming, read 
mapping, peak calling, motif discovery, transcript assembly, vari- 
ant calling and chromatin conformation analysis. These work- 
flows use publicly available published software, yet allow the end 
user to add his/her own code and create new workflows as 
required. Ngsane is also available as Amazon Machine Image 
and can be deployed to the Amazon Elastic Compute Cloud 
(EC2) using StarCluster to allow on-demand processing of 
samples without requiring software installation or HPC 
maintenance. 
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